In [1]:

    
import csv
import nltk
import math
import collections
from textblob import TextBlob
from pprint import pprint

Getting data into Python (basic python i/o)



In [2]:

    
csvfile = open('bernie-sanders-announces.csv','r')
reader = csv.reader(csvfile)
data = []
for line in reader:
    line[3] = line[3].decode('utf-8')
    data.append(line)



In [3]:

    
len(data)









    Out[3]:





3349



In [4]:

    
data[0]









    Out[4]:





['author', 'score', 'submission', u'body']



In [5]:

    
data[1]









    Out[5]:





['bobnojio',
 '368',
 'U.S. Sen. Bernie Sanders will announce his candidacy for the Democratic presidential nomination on Thursday',
 u"Honest question?   How does one announce that they will announce?   Isn't this pre-announcement technically the announcement itself?"]



In [6]:

    
comment_text = data[1][-1]

Basic python string manipulation



In [7]:

    
comment_text









    Out[7]:





u"Honest question?   How does one announce that they will announce?   Isn't this pre-announcement technically the announcement itself?"



In [8]:

    
comment_text[0]









    Out[8]:





u'H'



In [9]:

    
comment_text[2:6]









    Out[9]:





u'nest'



In [10]:

    
comment_text + comment_text









    Out[10]:





u"Honest question?   How does one announce that they will announce?   Isn't this pre-announcement technically the announcement itself?Honest question?   How does one announce that they will announce?   Isn't this pre-announcement technically the announcement itself?"



In [11]:

    
# tab complete
comment_text.split()









    Out[11]:





[u'Honest',
 u'question?',
 u'How',
 u'does',
 u'one',
 u'announce',
 u'that',
 u'they',
 u'will',
 u'announce?',
 u"Isn't",
 u'this',
 u'pre-announcement',
 u'technically',
 u'the',
 u'announcement',
 u'itself?']



In [12]:

    
split_on_questions = comment_text.split('?')
split_on_questions









    Out[12]:





[u'Honest question',
 u'   How does one announce that they will announce',
 u"   Isn't this pre-announcement technically the announcement itself",
 u'']



In [13]:

    
for string in split_on_questions:
    print string.strip()









    



Honest question
How does one announce that they will announce
Isn't this pre-announcement technically the announcement itself



In [14]:

    
cleaned = [s.strip().lower() for s in split_on_questions]
cleaned









    Out[14]:





[u'honest question',
 u'how does one announce that they will announce',
 u"isn't this pre-announcement technically the announcement itself",
 u'']



In [15]:

    
'?!?! '.join(cleaned)









    Out[15]:





u"honest question?!?! how does one announce that they will announce?!?! isn't this pre-announcement technically the announcement itself?!?! "



In [16]:

    
'Hilary' in data[80][-1]









    Out[16]:





True



In [17]:

    
clinton_count = 0
for row in data:
    if 'Hilary' in row[-1] or 'Clinton' in row[-1]:
        clinton_count += 1
clinton_count









    Out[17]:





219

Introducing TextBlob

Like a supercharged string, with lots of NLP niceties



In [18]:

    
blob = TextBlob(data[80][-1])
blob









    Out[18]:





TextBlob("Some social issues are of great importance like gay rights. My problem here is that these issues are solved by society more than politicians. Society has decided that gay people deserve the same rights as everyone else and nothing is going to stop that now, not even court rulings. This is why I am reluctance to vote for people based on social issues, especially people who don't actually have strong opinions on them and just base their decisions on polls numbers like Hilary did with gay marriage. 

Besides saying she wants to raise the minimum wage there is nothing that interests me in Hilary. When talking to the more blue collar democrats on here their only selling point for Hilary is she is pro-choice. Don't get me wrong this is an important issue but it's pretty far down my list. I would rather vote for someone who is anti-war and anti-wall street immunity while being anti-abortion than I would vote for the opposite of that.")



In [19]:

    
blob.sentences









    Out[19]:





[Sentence("Some social issues are of great importance like gay rights."),
 Sentence("My problem here is that these issues are solved by society more than politicians."),
 Sentence("Society has decided that gay people deserve the same rights as everyone else and nothing is going to stop that now, not even court rulings."),
 Sentence("This is why I am reluctance to vote for people based on social issues, especially people who don't actually have strong opinions on them and just base their decisions on polls numbers like Hilary did with gay marriage."),
 Sentence("Besides saying she wants to raise the minimum wage there is nothing that interests me in Hilary."),
 Sentence("When talking to the more blue collar democrats on here their only selling point for Hilary is she is pro-choice."),
 Sentence("Don't get me wrong this is an important issue but it's pretty far down my list."),
 Sentence("I would rather vote for someone who is anti-war and anti-wall street immunity while being anti-abortion than I would vote for the opposite of that.")]



In [20]:

    
blob.words









    Out[20]:





WordList([u'Some', u'social', u'issues', u'are', u'of', u'great', u'importance', u'like', u'gay', u'rights', u'My', u'problem', u'here', u'is', u'that', u'these', u'issues', u'are', u'solved', u'by', u'society', u'more', u'than', u'politicians', u'Society', u'has', u'decided', u'that', u'gay', u'people', u'deserve', u'the', u'same', u'rights', u'as', u'everyone', u'else', u'and', u'nothing', u'is', u'going', u'to', u'stop', u'that', u'now', u'not', u'even', u'court', u'rulings', u'This', u'is', u'why', u'I', u'am', u'reluctance', u'to', u'vote', u'for', u'people', u'based', u'on', u'social', u'issues', u'especially', u'people', u'who', u'do', u"n't", u'actually', u'have', u'strong', u'opinions', u'on', u'them', u'and', u'just', u'base', u'their', u'decisions', u'on', u'polls', u'numbers', u'like', u'Hilary', u'did', u'with', u'gay', u'marriage', u'Besides', u'saying', u'she', u'wants', u'to', u'raise', u'the', u'minimum', u'wage', u'there', u'is', u'nothing', u'that', u'interests', u'me', u'in', u'Hilary', u'When', u'talking', u'to', u'the', u'more', u'blue', u'collar', u'democrats', u'on', u'here', u'their', u'only', u'selling', u'point', u'for', u'Hilary', u'is', u'she', u'is', u'pro-choice', u'Do', u"n't", u'get', u'me', u'wrong', u'this', u'is', u'an', u'important', u'issue', u'but', u'it', u"'s", u'pretty', u'far', u'down', u'my', u'list', u'I', u'would', u'rather', u'vote', u'for', u'someone', u'who', u'is', u'anti-war', u'and', u'anti-wall', u'street', u'immunity', u'while', u'being', u'anti-abortion', u'than', u'I', u'would', u'vote', u'for', u'the', u'opposite', u'of', u'that'])



In [21]:

    
blob.tokens









    Out[21]:





WordList([u'Some', u'social', u'issues', u'are', u'of', u'great', u'importance', u'like', u'gay', u'rights', u'.', u'My', u'problem', u'here', u'is', u'that', u'these', u'issues', u'are', u'solved', u'by', u'society', u'more', u'than', u'politicians', u'.', u'Society', u'has', u'decided', u'that', u'gay', u'people', u'deserve', u'the', u'same', u'rights', u'as', u'everyone', u'else', u'and', u'nothing', u'is', u'going', u'to', u'stop', u'that', u'now', u',', u'not', u'even', u'court', u'rulings', u'.', u'This', u'is', u'why', u'I', u'am', u'reluctance', u'to', u'vote', u'for', u'people', u'based', u'on', u'social', u'issues', u',', u'especially', u'people', u'who', u'do', u"n't", u'actually', u'have', u'strong', u'opinions', u'on', u'them', u'and', u'just', u'base', u'their', u'decisions', u'on', u'polls', u'numbers', u'like', u'Hilary', u'did', u'with', u'gay', u'marriage', u'.', u'Besides', u'saying', u'she', u'wants', u'to', u'raise', u'the', u'minimum', u'wage', u'there', u'is', u'nothing', u'that', u'interests', u'me', u'in', u'Hilary', u'.', u'When', u'talking', u'to', u'the', u'more', u'blue', u'collar', u'democrats', u'on', u'here', u'their', u'only', u'selling', u'point', u'for', u'Hilary', u'is', u'she', u'is', u'pro-choice', u'.', u'Do', u"n't", u'get', u'me', u'wrong', u'this', u'is', u'an', u'important', u'issue', u'but', u'it', u"'s", u'pretty', u'far', u'down', u'my', u'list', u'.', u'I', u'would', u'rather', u'vote', u'for', u'someone', u'who', u'is', u'anti-war', u'and', u'anti-wall', u'street', u'immunity', u'while', u'being', u'anti-abortion', u'than', u'I', u'would', u'vote', u'for', u'the', u'opposite', u'of', u'that', u'.'])



In [22]:

    
blob.noun_phrases









    Out[22]:





WordList([u'social issues', u'great importance', u'gay rights', u'gay people deserve', u'court rulings', u'social issues', u'strong opinions', u'polls numbers', u'hilary', u'gay marriage', u'minimum wage', u'hilary', u'blue collar democrats', u'hilary', u'important issue', u'anti-wall street immunity'])

Summarizing/keywording text

How might we find representative words or phrases of a document?

A place to start: which words appear at the highest frequency in this document?



In [23]:

    
blob.word_counts









    Out[23]:





defaultdict(<type 'int'>, {u'saying': 1, u'just': 1, u'opinions': 1, u'being': 1, u'democrats': 1, u'anti-abortion': 1, u'decisions': 1, u'issues': 3, u'wage': 1, u'based': 1, u'reluctance': 1, u'actually': 1, u'to': 4, u'only': 1, u'going': 1, u'pretty': 1, u'has': 1, u'do': 2, u'them': 1, u'someone': 1, u'get': 1, u'far': 1, u'stop': 1, u'not': 1, u'now': 1, u'like': 2, u'pro-choice': 1, u'did': 1, u'list': 1, u'this': 2, u'she': 2, u'everyone': 1, u'people': 3, u'some': 1, u'society': 2, u'decided': 1, u'are': 2, u'blue': 1, u'for': 4, u'solved': 1, u'issue': 1, u'importance': 1, u'here': 2, u'base': 1, u'besides': 1, u'strong': 1, u'by': 1, u'on': 4, u'great': 1, u'anti-wall': 1, u'of': 2, u's': 1, u'social': 2, u'anti-war': 1, u'point': 1, u'down': 1, u'vote': 3, u'raise': 1, u'would': 2, u'there': 1, u'hilary': 3, u'collar': 1, u'their': 2, u'more': 2, u'wants': 1, u'selling': 1, u'gay': 3, u'that': 5, u'but': 1, u'else': 1, u'with': 1, u'than': 2, u'me': 2, u'rights': 2, u'these': 1, u'polls': 1, u'rulings': 1, u"n't": 2, u'while': 1, u'street': 1, u'marriage': 1, u'problem': 1, u'my': 2, u'and': 3, u'interests': 1, u'is': 8, u'am': 1, u'it': 1, u'an': 1, u'talking': 1, u'as': 1, u'minimum': 1, u'numbers': 1, u'have': 1, u'in': 1, u'court': 1, u'rather': 1, u'deserve': 1, u'when': 1, u'same': 1, u'politicians': 1, u'immunity': 1, u'even': 1, u'opposite': 1, u'who': 2, u'wrong': 1, u'important': 1, u'nothing': 2, u'why': 1, u'especially': 1, u'i': 3, u'the': 4})



In [27]:

    
word_count = collections.Counter(blob.word_counts)



In [28]:

    
word_count.most_common(5)









    Out[28]:





[(u'is', 8), (u'that', 5), (u'to', 4), (u'for', 4), (u'on', 4)]

Challenge: get overall word counts for all comments combined

The Problem: words we use frequently don't make good unique identifiers.

A solution: use a list of words we don't want to include

"Stop Words"



In [29]:

    
stopwords = nltk.corpus.stopwords.words('english')



In [30]:

    
nltk.download()









    



showing info http://www.nltk.org/nltk_data/






    Out[30]:





True



In [31]:

    
for key in word_count.keys():
    if key in stopwords or len(key) <= 2:
        del word_count[key]



In [32]:

    
word_count.most_common(5)









    Out[32]:





[(u'vote', 3), (u'issues', 3), (u'hilary', 3), (u'gay', 3), (u'people', 3)]

We could continue to add on stopwords as we try to make these keywords better. But it's kind of like playing whack-a-mole

An additional solution to The Problem: add a new term to our "representative-ness" measure that accounts for the overall rarity of the word

$$\frac { { n }_{ w } }{ N } $$

where ${ n }_{ w }$ is the number of documents containing word $ w $, and $ N $ is the total number of documents.

But we want a potential keyword to have a lower score if it is common in the corpus and a higher score if it is rarer, so we flip it:

$$\frac { N }{ { n }_{ w } } $$

It's also common to take the log of this to reduce the amount of disparity between extremely common and extremely uncommon terms.

$$ \log \frac { N }{ { n }_{ w } } $$

This is called IDF, or Inverse Document Frequency. Let's calculate it for all the words in our comment dataset!



In [33]:

    
N_documents = float(len(data))
word_document_counts = collections.Counter()
word_idf = {}



In [34]:

    
for row in data[1:]:
    blob = TextBlob(row[-1].lower())
    words = blob.word_counts.keys()
    word_document_counts.update(words)



In [35]:

    
for key, val in word_document_counts.iteritems():
    word_idf[key] = math.log(N_documents/val)

For each word $ w $ in a given document $ D $, we can multiply the term frequency $$\frac { { D }_{ w } }{ { W }_{ D } } $$

where $ { D }_{ w } $ is the number of occurrences of word $ w $ in document $ D $

and $ { W }_{ D } $ is the total number of words in document $ D $

with the word's IDF that we just calculated to get TF-IDF scores, the highest ones being words that likely to be good representatives of that document.



In [36]:

    
comment = data[80][-1]
blob = TextBlob(comment.lower())
num_words_in_comment = len(blob.words)
word_count = blob.word_counts

tf_scores = {}
for word, count in word_count.items():
    if word not in stopwords and len(word) > 2:
        tf_scores[word] = float(count)/num_words_in_comment



In [37]:

    
tf_idf = {}
for word, tf in tf_scores.items():
    tf_idf[word] = tf*word_idf[word]

sorted(tf_idf.iteritems(), key=lambda k: k[1], reverse=True)[:5]









    Out[37]:





[(u'gay', 0.08420035162735802),
 (u'hilary', 0.07152755729680166),
 (u'society', 0.06289528248497606),
 (u'issues', 0.06252315278487401),
 (u'anti-abortion', 0.048312006385679784)]

Note that TF-IDF can be tweaked in lots of other ways if you aren't getting good results.

It can also be done with "n-grams"— phrases that are n words long to capture multi word phrases like "gay rights" or "hillary clinton"

Additional demonstrations

Boiling down words: stemming



In [38]:

    
from nltk.stem.porter import PorterStemmer



In [39]:

    
stemmer = PorterStemmer()
print stemmer.stem('political')
print stemmer.stem('politics')
print stemmer.stem('politician')









    



polit
polit
politician

Seeing words in context: concordance



In [40]:

    
from nltk.text import Text
tokens = TextBlob(data[80][-1]).tokens
text_object = Text(tokens)
text_object.concordance('Hilary')









    



Displaying 3 of 3 matches:
eir decisions on polls numbers like Hilary did with gay marriage . Besides say
ere is nothing that interests me in Hilary . When talking to the more blue col
n here their only selling point for Hilary is she is pro-choice . Do n't get m

Sentiment Analysis



In [87]:

    
blob = TextBlob(data[41][-1])
blob









    Out[87]:





TextBlob("Bernie has said himself, to have any chance of winning, he will need to organize an unprecedented grass roots movement.

The enthusiasm shown in the thread is giving me great hope!!
Go Bernie!")



In [88]:

    
blob.sentiment









    Out[88]:





Sentiment(polarity=0.7000000000000001, subjectivity=0.7999999999999999)



In [90]:

    
blob.sentences[1].sentiment









    Out[90]:





Sentiment(polarity=1.0, subjectivity=0.75)